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Abstract 

This paper deals with the supervised classification when the response variable is binary and its class 
distribution is unbalanced. In such situation, it is not possible to build a powerful classifier by using standard 
methods such as logistic regression, classification tree, discriminant analysis, etc. To overcome this short- 
coming of these methods that provide classifiers with low sensibility, we tackled the classification problem 
here through an approach based on the association rules learning because this approach has the advantage 
of allowing the identification of the patterns that are well correlated with the target class. Association rules 
learning is a well known method in the area of data-mining. It is used when dealing with large database for 
unsupervised discovery of local patterns that expresses hidden relationships between variables. In considering 
association rules from a supervised learning point of view, a relevant set of weak classifiers is obtained from 
which one derives a classification rule that performs well. 



1 Introduction 

This paper deals with the supervised classification when the response variable is binary and its class distribution is unbal- 
anced. In such situation, standard methods such as logistic regression |12| . classification tree, discriminant analysis, etc 
|10| do not make it possible to build an efficient classification function. They tends to focus on the prevalent class and to 
ignore the other one. Several works were devoted on this subject, even in the recent past, as well as from the conventional 
statistical point of view as such that of machine learning. Some works among them will consider the improvement of 
the regression models' fitting to produce a classification function with a small prediction bias without loosing interesting 
features of the standard methods as the ability to evaluate the contribution of each covariate in the variations of target 
class probability (regression methods) or the identification of the risk pattern (tree method). Alternative approaches 
consist in aggregation techniques like boosting and bagging 7 which combine multiple classification functions with large 
individual error rate to produce a new classification function with smaller error rate 

Our aims is to propose a statistical learning method that provides an efficient classifier and allows to identify correlated 
relevant patterns with the response relevant pattern. To achieve this goal we took the route toward the association 
rules learning which is a well known method in the area of data mining. It is used when dealing with large database for 
unsupervised discovery of local-patterns that express hidden and potential valuable relationships between input variables. 
In considering association rules from a supervised statistical learning point of view, a relevant set of weak classifiers is 
obtained from which one derives a classification function that performs well. Such an approach is not actually new since 
it has been already considered in the machine learning literature [S]. In the present work we aim at inserting it within 
the traditional framework of the statistics and showing its relevance by its application to a problem of the real world 



2 An overview on association rules 



2.1 Background: basic definitions and notations 



Let m be an integer greater than 1 and 1 : m is the set of all integers from 1 to m. Let's denote {Ah),. a set of 
m attributes that describe the elements of a population Q, each attribute Ah being evaluated on a non numerical scale 
made of qn levels [p,h^(K)) ^i h )—\ti • A transaction is a sample unit t 6 fl. 

Definition 1. An item is a binary variable Ah j(h) such that Ah,j(h) = 1 if and only if Ah — a h ,j(h)\ 
An itemset is a collection of items {Ah,j(h)) heI where I C 1 : m; 

The length of the itemset {Ah,j(h)) h£I is equal to the size of the set I C 1 : m and then a k-itemset is an itemset of 
length k. 

Two itemsets [A h ,j(h)) heI an d [Ah,j(h)J heJ are disjoint itemsets if / and J are disjoint subsets of 1 : m. 



The support of the itemset [A hy j(h)) h 7 is the probability Pr 



From statistical point of view an itemset { A h,j(h)) heI can be understood as the expression of an interaction between 



categorical variables (Ah) h=1 . m that it is made of and hence the event 
significant probability of occurrence. 



hei 



,j(h) 



= 1 



is a relevant pattern if it has a 



Definition 2. Let's consider two disjoint itemsets X = { A h,j(h)) heI and Y = (A hjj ( h ^ he .. An association rule is an im- 



plication of the form X — > Y meaning that the probabilities Pr 
are significant. 



n A hjj{h) = i 



.heiuj 



and Pr 



.he J 



Yl A h,j(h) 

hei 



An association rule X — > Y expresses the fact that not only there is a high probability that the events 



.hei 



and 



.he. J 



occur simultaneously but also 



he. 7 



has a high probability of occurrence under the 



conditions specified by the event 



.hei 



Definition 3. Let's consider an association rule X — > Y where X = (Ah,j(h)) heI and Y = ] . The 



probability Pr 

he.J 



U A 

heiu.i 
= 1 



h,j(h) - 

Yl A h,j(h) 

hei 



is called the support of the association rule and the conditional probability 
is its confidence. 



The probability Pr 



n a 



.heiuj 

the conditional probability Pr 



h,j(h) - 

Yl A h,j(h) 

he. i 



tells us how frequent the event 



II A h,3(h) 
.helUJ 



occurs while 



hei 



measures le correlation between the events 



.hei 



and 



AS. 7 



2.2 Mining associations rule in a large database 

Apriori is one of the most widely implemented association rules mining algorithms that pioneered the use of support- 
based pruning to systematically control the exponentially growth of candidate rules. Hereafter the pseudo code for the 
frequent itemset generation part of the apriori algorithm. Let Ck denote the set of candidate k-itemsets, T denote the 
set of all transactions and Fk denotes the set of frequent k-itemsets: 
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Algorithme : Frequent rules generation of the Apriori algorithm 



k=l 

F k = {Find all frequent 1-itemsets} 
repeat 

k=k+l 

Cfe = apriori-gen(F fc _i ). {Generate candidate itemsets} 
for each transaction t ST do 

C t = subset (Cfe , t). {Identify all candidates that belong to t} 
for each candidate itemset c G Ct do 

supp(c) — supp(c) + 1. {Increment support count} 
10: if t. class = c.class do 

11: conf(c) — conf(c) + 1. {Increment for confidence count} 

12: end if 

13: end for 
14: end for 

15: Ft ={c G Cfe | supp(c) > minsupp ; conf(c)/supp(c) > minconf } 

{Extract the frequent rules of length k} 
16: until F k = 
17: Result = \J k F k 



• First the algorithm makes a single pass over the dataset to determine the support of each item. Upon completion 
of this step, the set of all frequent 1-itemsets, Fi , will be known (steps 1 and 2). 

• Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k — l)-itemsets found 
in the previous iteration (step 5). Candidate generation is implemented using a function called apriori-gen, which 
generates new candidate k-itemsets based on the frequent (k — l)-itemsets found in the previous iteration and 
eliminates some of the candidate k-itemsets using the support-based pruning strategy. 

• To compute the support and the confidence of the candidates rules, the algorithm needs to make an additional 
pass over the data set (steps 6-14). The subset function is used to determine all the candidate itemsets in Ck that 
are contained in each transaction t. 

• After counting their supports and confidence, the algorithm eliminates all candidate rules whose support and 
confidence are less than minsup and minconf respectively (step 15). 

• The algorithm ends when there are no new frequent itemsets generated, i.e., F k — (step 16). 

3 Class association rules set as supervised association rules learning 
framework 

Since our aim is to specify a classification function on the basis of statistical processing analysis of the data, we will 
focus on a special case of association rules by contraining their right hand side itemset to be reduced to the classification 
target class indicator. 



Definition 4. Let Y be the classification target class indicator. A class association rule is an association rule of the 



form X -> Y where X = (A h . 



3(h), 



hei 



is an itemset disjoint with Y. 



Let's consider the classification function defined as <j> (t) = J^J A h ^^ h ) (t) where t G is a transaction and 



hei 



X — {Ah,j(h)) heI is a given itemset. The true-positive rate (tpr) of this rule is equal to 



Pi- 



ll A h,Hh) 

.hei 



(fpr) is Pr 



,j(h) 



Y = 1 f , its true-negative rate (tnr) is Pr 

Y = > and its false-negative rate (fnr) is Pr 



II A h,m = o 

hei 



n a w» = ° 

.hei 



V = >, its false | rate 
K^ 1 



Definition 5. Let Y be the classification target class indicator and X — (Ah,j(h)) heI be an itemset. The true positive 
rate of the classification function 4> (t) = Y\ ^h,j(h) (t) is called local support of the class association rule X — > Y 



hei 



3.1 Risk pattern and relative risk 

When the target class of a binary outcome has low probability of occurrence it is usual to refer to the relative risk 
statistics to evaluate the correlation between the target class indicator and another binary variable that expresses a 
pattern of the population. 



3 



Definition 6. Let's consider an itemset X = {Ah,j(h)) heI where Icl:m and denote Y the indicator of the target 
class of a binary outcome. The relative risk of Ygiven the itemset X is the following probabilities ratio RR(X,Y) = 



3(h) 



-=iin =° 



hei 

The itemset X = {Ah,j(h)) h£I is a risk pattern for Y if the relative risk RR (X,Y) exceeds a given threshold r > 1. 

3.2 Some tools for the selection of non redundant class association rules 

Processing data with the apriori algorithm usually produces a huge number of association rules, certainly more than it is 
necessary to build a classification function that is efficient and easy to implement. On the basis of the point of view that 
a few number of relevant association rules should be enough, one should seek suitable tools to prune association rules 
that generate very weak classification functions among those which were obtained using apriori algorithm. The matter 
of this section is to bring out some basic principles which could help with the achievement of this task. To this end one 
will pay attention to the subset of rules whose the risk patterns are nested. 

Definition 7. Let U = {Ah,j(h)) heI an d U' = (j4ji,;(h)) he 7 be two itemsets such that I C J. The itemset 
U' is redundant if the classification function generated by the class association U — > Y has better performance measures. 



Let U — {Ah,j(h)) heI and U' = { A h,i(h)) heJ be two itemsets such that I C J. It comes from the following 



inclusions 



n A h,j(h) = i 

LheJ 



C 



Pr 



Pr 



he j 



,j(h) 



= 1 



II = o 

hei 



Y = 1 \ < Pr 



[Y = l] c } < Pr 



n A h,j(h) = 1 
hei 

f7 A hjj(h) = 
hei 

n = ° 

he.i 



and 



n A ^(h) = o 



.hei 
Y = 1 V and 



c 



II A *,m = 



that 



[Y = 1]' 



Therefore the true-positive rate and the true-negative rate are sorted in the opposite way for the classification functions 
generated by two risk patterns if one of them is nested in the second one. It is worth to notice that in case where the 
true-negative rates are equal the classification function generated by the risk pattern with the smallest size is better 
since its true-positives rate is the highest. In a similar way if the true-positive rates are equal the classification function 
generated by the risk pattern with the highest true-negative rate is the best. This provides a criterium for pruning 
redundant risk pattern. Moreover on can state: 

Proposition 1. Let U = {Ah,j(h)) heI and U' = { A h,j(h)) heJ be two itemsets such that I C J and both associa- 



tion rules U — > Y and U' — > Y are valid, if Pr 
RR(U',Y) < RR(U,Y) 



j(h) 



hei 



[y = iy 



=Pr 



n A h,j(h) = i 

he. 7 



then 



Proof. For simplicity, let 4>u = ]^[ A h)j(h ) and <$> u i = Y[ A hjjW 

hei he.i 
U <U' implies Pr {[4>u' — 1]} < Pr {[4>u = 1]}; On the other hand we have 

Pr{[^ = l],[r = l]} _ Pr{[^ = l],[F = l]} 



and 



Pr{[0c = 1]} Pr{[<^ = 1] , [Y = 1]} + Pr{[<^ = 1] , [Y = l] c } 
Pr{[fr,, = l],[y = l]} _ Pr{[<fa, = l],[y = l]} 



Pr{[^/ = 1]} 



Pr{[<f> w = 1] , [Y = 1]} +Pr{[&,, = 1] , [Y = l] c } 



Taking into account that Pr {[<j>u = 1] , \Y = l] c } = Pr{[«fov = 1] , [Y = l] c } and that ^ > ^ if a > c it is concluded 
that 



Pr{[^ = l],[F = l]} > Pr{[<^, = l],[Y = l]} 



Pr{[^ = l]} 



Pr{[^/ = 1]} 



otherwise 



and 



Pr{[0 y = l] c , [V = 1]} Pr{[<fa = If} - Pr{[^ = l] c , [Y = If} 

Pr{[^ = l] c } - Pr{[<^ = l] c } 

Pr{[fc, = l]°,[y = l]} Pr{[<fo, = 1] C }-Pr{[,fa, = 1]° , [y = 1]°} 

Pr{[^ = l] c } Pr{[<^, = l] c } 
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It follows from the assumption Pr {[(fiu = 1] , [Y = l] c } = Pt{[4>jji = 1] ,[Y — l] c } and the following equalities 



Pr{[y=in = Pr{[^ = l],[V = l] c }+Pr{[^ = l] c ,[F = l] c } 
= Pr {[fa, = 1] , [Y = l] c } + Pr{[^ = l] c , [Y = l] c } 



that Pr{[^ = l] c ,[y = l] c } = Pr{[^ = l] c ,[y = l] c }. 



Taking into account that Pr {[(j>u = l] c } < P r {[0t/' 
we obtain 



in, 



p r {[^ = ir,[y = in > PT{[<h' = i] c ,[y = in 



Pr{[^ = l] c } 



Pr{[^ = in 



and it follows that 



Pr{[0a = l] c } 
and we conclude that RR (U' , Y) < RR (U, Y) 



Pr{[^ = l] c ,[y = l]} < Pr{[^ = l] c ,[y = l]} 



Pr{[^ = l] c } 



□ 



This property has been pointed out first in Jiuyong Li & al. as the antimonotonic property [8]. Then Jiuyong li & al. 
suggested to use it to obtain an optimal set of association rules by pruning redundant class association rules. Besides 
the statement that the relative risks of two nested patterns are in the same increasing order as their sizes when their 
false-positives rates are equal, it is important to notice for such risk patterns the largest true-positives rate is that of the 
pattern with the smallest size. 



Corollary 1. Let U = (A M(fc) ) k6J ; U' = (A h , m ) h£J and U" = (A h>m ) heK 
I C J C K. If U" — > y is redundant then U' —}Y is automatically redundant. 



be three nested itemsets such that 



Proof. We have: U, U' and U" are three nested itemsets 



Pr 



n A h ,j(h) = i 

he/ 



[Y = l] c > Pr 



II A ^m = i 

.hEJ 



But U" -> y redundant => Pr 



from which Pr 



n A hjj{h) - 1 



n A hjj(h) = i 



[Y = l] c } > Pr 
,[y = l]H=Pr 



n A h , jW = i 

,h£K 



, [Y = 1Y 



[Y = l] c }= Pr 



= i 



n A hjjW = i 

h£K 



[Y = IV 
[Y = l] c } and more U <U' 



From Proposition 1. RR(U',Y) < RR{U,Y) 



□ 



Definition 8. Let J- (Y) be a subset of a set of class association rules with [Y = 1] at the right hand side. T (Y) is an 
optimal family if U = (A^-^y) h and U' = {Ah,j(h)) heJ are two nested itemsets such that / C J, U — > Y G F (Y) and 
J7' ->y 6 F{Y) then RR([/,y) < RR(C/',y) 

T{Y) = {f/ ^ y| V (7 <U' ', KR(U,Y) < ER(U',Y)} 



The family of optimal rules T (Y) is non empty and unique. A more narrow condition for pruning is obtained as 



Proposition 2. Let (Ah,j(h)) heI and (Ah,i(h)) he J be two itemsets such that I C J 



If Pr 



Yl A h ,j{h) = 1 
.hei 



hei 



Pr 



n A hiJ(h) = i 

.he J 



f/ien both following equalities hold: 



, [Y = 1] = P 



n A h>j{K) - 1 



\Y = ll e ^ = Pr 



n = i 

LheJ 



,[Y = 1 
,[Y = 1 



Proof. For simplicity, let <jw = Y\ A h ,j(h) and <j> V i = Y\ A h,j(h) 

hei heJ 

We have 



Pr{[<fe, = l],[y = l]} + Pr{[^, =l],[y = l] c } = Pr{[^ = l]} 
Pr{[^ = l],[y = l]} + Pr{[^ = l],[y = l] c } = Pr{[^ = l]} 
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by assumption 
therefore 



We have 

because 

therefore 

otherwise 



Pr{[<^, = l]}=Pr{[<^ = l]} 
Pr{[fa, = 1] , [Y = l] c } - Pv{[fa = 1],]Y= l] c } = Pr {[fa = 1] , [Y = 1]} - Pr{[^, = 1] , [F = 1]} 
Pr{[fa = 1] , [Y = 1]} - Pt{[4> w = 1] , [F = 1]} > 

{[^ = i],[y = i]}c{[^ = i],[y = i]} 

Pr{[<fc, = l],[y = l] c }-Pr{[fc, = l],[y = l] c }>0 (1) 

= i],[y = i] c }c{[<fc, = i],[y = in 



therefore 

Pr{[fc,, = l],[y = l] c }-Pr{[fc, = l],[y = l] c }<0 (2) 
from (1) and (2) it follows that 

Pr{[<fc, = 1] , [Y = l] c } = Pr{[fc, = 1] , [Y = l] c } 
Pr{[fa, — 1] ,[Y — 1]} = Pr{[«^ = 1] , [Y = 1]} 



□ 



Let consider two itemsets U = {Ah,j(h)) heI and U' = {-^h,j(h)) heJ such that U is nested in U' . If the equality 



Pi- 



ll A h,Hh) 



Pr 



n Ah ,jw 

he J 



1 



holds, the true-positive rate of the classification function generated 



by the rule U — > Y is as good as that of the classification function generated by the rule U' —¥ Y and there is the same 
conclusion for the true-negative rates. Therefore the class association rule U' — > Y should be pruned since it is clearly 
redundant. 

The statement that follows compares two nested risk patterns on the basis of the true-positives rates of the classification 
functions they generate. 

Proposition 3. Let U = {-^h,j(h)) heI an d U' = {Ah,j(h)) heJ be two itemsets such that I C J and both association rules 



U — > Y and U' — > Y are valid, if Pr 
RR(U',Y) 



n Ah ,jw 



[Y = l]\ =Pr 



II A h,m 

he J 



, [Y = 1] } then RR(U, Y) < 



Proof. For simplicity, let fa = \\ A hij(h) and fa, = ]^[ A hjj ( h) 

hei he.J 

HU <U' then 

1 1 

< — — ttt- and 



Pr{[^ = l]} " Pr{[«^, = 1]} 



Pr{[<fc, = l] c } " Pr{[^ = l] c } 



< 



If condition Pr {[fa = 1] , [Y = 1]} = Pr {[fa, = 1] , [Y = 1]} is true then 

Pr{[fr, = l],[y = l]} _ Pr{[fa, = l],[y = l]} 
Pr{[<fc, = l]} Pr{[^ = l]} 

Pr{[fa, = iyy = i]} 

Pr{[^ = l]} 

Pi {[fa = 1] , [Y = 1]} = Pr{[fa, = 1] , [Y = 1]} condition implies also 

Pr{[<^ = i] c ,[y = i]} = Pr{[y = i]}-Pr{[^ = i],ry = i]} 

= Pr{[y = l]}-Pr{[^i = l],[y = l]} 
= Pr{[fa, = l]\[Y = l}} 

hence 

Pr{[^ = l] c ,[y = l]} _ Pr{[fa, = i] c ,[y = i]} 



Pr{[fa = l] c } 



> 



Pr{[<^ = l] c } 
Pr{[fr,, = l] e ,[y = l]} 
Pr{[fa, = lU 



□ 
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Hence, it comes from the statement above that in case of equality of the true-positives rates of the classification functions 
generated by two nested risk patterns, not only the sparsest has the smallest false-positives rate but it has also the smallest 
relative risk. Therefore one can take out the risk pattern with the smallest size since the classification function associated 
with it has weak performance indexes. 



4 Statistical learning of classification function based on class associ- 
ation rules mining 

Let D be a dataset gathering n observations, each one of them being considered as a transaction, according to idiom of 
the community of the association rules learning. The comparison of nested risk patterns for pruning needs to carry out 
statistical hypothesis testing 

Proposition 4. Let U = (^4h,j(h)) heJ an <l V = {Ah,l(h)) heJ be two nested itemsets where I C J. Let n n (U) = 
££Il y (*)> *« i U ') = ^Eil y (*) and tti = Pr{[y = i]}. 

Let 4>u = J^J A-h,j(h) and <j} V i — J^J A hjj ( h ). For a given positive integer k , consider the statistical hypothesis testing that 
hei heJ 

consists in the rejection of the null hypothesis Pr ([4>u' — 1] , [Y — 1]) — ? r {\4>u = 1] , [Y = 1]) if nit „ (U) > n-jt n (I/') + k 
and its acceptance otherwise. Its significance level equal to and its power is asymptotically greater than 1 — <2>{u n } 

where $ is the qaussian (cumulative) distribution function and u n — , 1 

1 ' / (*„([/)-ft„(C7'))(l + ft„([/')-<r„(I/)) 

Y n 

A similar statement holds if the null hypothesis of the statistical hypothesis testing is Pr {[<t>u' = 1] , [Y = l] c ) = Pr ([<j>u = 1] , [Y = l] c ) 



Proof. Since J'jY.A;,^/,) and J^yAh^^/,) are Bernoulli variables and — 1 C FJ^^h^fh) = 1 then 

hei he./ LheJ J Lhei 

Y\Y (t) A hj j( h -j (t)— Y\ Y (t) Ah, : j{h) (t) are independent Bernoulli trials as t runs through V. Therefore \\Y {t) ^4h,j(h) (*) — 
hei heJ tevhei 

1~[ y (t) Ahj(jh) (t) is a binomial trial with n Bernoulli trials and success probability n = Pr([0t/' = 1] > = 1]) = 
teche./ 

Pr ([tfo = 1] , [Y = 1]). n (*„ (f/) - (f/')) = (*) II Y (*) ^^'<M (*) and Pr t n (*» ( U ) " *» ( C/ ')) ^ fc l = 

teche/ tevhe.J 
for any positive integer fc if 7r = 0. Thus the significance level of the statistical hypothesis testing is null. Owing to 



central limit theorem 



value v n = 



(ft re (I/)-ft„(E7'))- 
(*„(t/)-»„(t/'))(l + *„([/')-*„(!7)) 



is a standard Gaussian trial, therefore the probability it exceed the 



(ft„ (U)-«„ (U'))(l + *„ (U')-* n (U)) 



is 1 — $ {v n } > 1 — $ {u n } since ir < 7Ti and $ is an increasing function. □ 



The statistical learning procedure made of three steps described hereafter: 
Stage 1: 

1. Collecting frequent risk patterns: 

At the first stage, a training dataset is explored in order to generate a set of the frequent risk patterns as 
candidate risk patterns. This is achieved by using apriori algorithm to generate association rules where the 
right hand side itemset is reduced to the indicator function of the target class and discarding those for which 
the relative risk is below some fixed threshold r. 

This step is followed by the pruning the redundant risk patterns on the one hand and the risk patterns which 
generate weak classification functions on the other hand. 

2. Pruning the redundant risk patterns 

Let L\ be the maximum size of the candidate risk patterns and fix I — L\ + 1. 

(a) update Z by Z — 1 

(b) Let Si be the subset of the risk patterns of size I and S2 the subset of the risk patterns of size I — 1 

(c) for each risk pattern U' G Si 

• perform a statistical hypothesis testing where the null hypothesis Pr([0[// = 1] , [Y = l] c ) = Pr ([(j>u — 1] , [Y = l] c ) 
is considered against its opposite; 

• discard the risk pattern U' if the null hypothesis is accepted for some U ; 

(d) Repeat the three steps above as long as I > 2. 

3. Pruning the risk patterns 

Let 1/2 be the maximum size of the candidate risk patterns and fix I — 2. 
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(a) update Z by Z — 1 

(b) let S*i be the subset of the risk patterns of size I and S2 the subset of the risk patterns of size I + 1 

(c) for each risk pattern U £ Si 

• perform a statistical hypothesis testing where the null hypothesis Pr([(f>u = 1] , [Y = 1]) = Pr([(fo/' = 1] , [Y = 1]) 
is considered against its opposite; 

• discard the risk pattern U if the null hypothesis is accepted for some U' . 

(d) Repeat the three steps above as long as I < L2. 

The output of this stage of procedure is a smaller set of class association rules T (Y) which are potentially 
non redundant. 

Stage 2: 

The second stage of the statistical learning procedure is performed by using a validation dataset. The family 
J-(Y) of the class association rules produced at the previous stage is screened in order to select representative risk 
patterns. Then a classification function is stated by combining these class association rules in such a way that an 
observation is classified as positive if it fits at least one risk pattern, and a negative one. 

1. The first step in this stage consists in updating estimation of the relative risk for each pattern. 

2. After that, one proceeds as follows to constitute the set of representative risk patterns: 
For each record in the target class: 

• identify all the risk patterns that describe the record; 

• select as member of the set of the representative risk patterns one the risk patterns whose value of the 
relative risk is maximum and was not already retained. 

Stage 3 

The final stage of the procedure deals with the assessment of the classification function and is performed on a 
test dataset. The performance of the classifier was assessed using the sensitivity and the specificity statistics. The 
sensitivity is the expected frequency of true-positive sample units while the specificity is the expected proportion 
of true-negative sample units. 



5 Application to in-hospital maternal mortality in Senegal and Mali 
5.1 The dataset 

The data under consideration in this application has been gathered by using a randomized and controlled trial (trial 
QUARITE). The hospital is the unit of randomization and intervention while the patient admitted for childbirth is the 
unit of analysis. The data were obtained during the pre-intervention phase which took place from October 1st, 2007 to 
September 30, 2008 in Senegal and from November 1st, 2007 to October 31rd, 2008 in Mali. Only the patients' data are 
analysed in this paper. 89518 patients of the available sample are described by 25 variables split in three groups: a first 
group of seven variables describing the state of the patient status before the current pregnancy, a second group of eleven 
variables dealing with the progress of the pregnancy and a third group of seven variables describing the course of delivery. 
The binary target variable takes the value 1 if the patient died before being enable to leave the hospital (617 patients) 
and otherwise (88901). The data analysis has two goals: (1) Identify the characteristics patterns of patients who died; 
(2) state a classification rule that performs well and is easy to understand as decision tree or a logistic regression. 
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Variables 


Codes values 


PyDVinnc TY 1 _~i i~\ ir>fi 1 r 1 "\ 7" ^~ t~\ y» "\ r 
JL 1 (J V1U LIS IIlt;U.l*_-cll llj & L t_) 1 _y 




.rV^t; ^lUUJJ ilUy ctllLL less, ±(~0'4: ; *-*Oy gLXIQ IIlUIcl 




PnTii".v ( mill inarmis 1-4 anrl rnnrpl 

J. ClrX 1 Lj V * 11 LHllJ_/Cll ULlOj _1 ^ O C_1H_1 111V-J1 13 1 


1 2 3 


ll n ro,n if 1 a t* _~pti a 1 nvnprtpn qipiti 

V_ylll >_/lllt_ d.1 UCI Idl 11 V JJC1 IClliDlUll 


n i 

U, 1 


Chronic Cardiac/renal disease 


1 


Chronic pulmonary disease 


1 


Sickle Cell disease 


1 


Previous caesarean section 


1 


Current pregnancy 




1 /~tc_"t~_i"i~~i/'"_YiQl hi t~x~\ /^rf nnoi An 
v^eoLttLlUllctl IiyjJt,! LoIlblOll 


n i 

U, 1 


r* A on q yy*i 1~\C1 o / f»nl q vn nci Q 

JT I c-t;Ciaiil]->olcL/ cL,ldlIlJJbl(i 


n i 

U, 1 


V ct^lilcLl Ult_ cvUlilfJ, lllcctl Lilt: LcIIIl) 


n i 

U, 1 


Qpvprp pTirnnip aTiapmifi 

KJC VCir V_.H1 will \^ cHlcLyi-llllcl 


n i 


Gestational diabetis 


1 


PrATTi dturfi t"i i T\\ 1 1 T"^i at ^~ri(^> thatti nrcinoQ 
JT 1 clilctL 111 t_ 1 LlJJLU.lt. L)l Lilt. lllclllUl cliltio 


n i 

U, 1 


TTtitip lTifpptinTi /T_vplnTipr.Viri1~iQ 

\J 1 lilt- LI CvVtL llllt>\_'_ l lL_ f ll/ l_^yClV_'llV_'l_/llllljlO 


n i 


HIV /AD 


1 


M&laxia. 


1 


Alultiple pregnancy 


1 


Antenatal pflTp flttpndanpp Inn visit 1-^ 4 find 


1 2 3 


mor 6^ 




Labour and delivery 




KAT^irr!il Ti-mfi annf nor npc*ll~ri TQfilil~Tr 
XVClClIcLl 1IUII1 dllULIlCl IlCdlLIl IcLClllLV 


n i 

U, 1 


Labour induction 


1 


IVTodp of dplivprv 




normal vftpHnfil 


1 


forceps / vacuum 


2 


ptti pl"crpn pv 3nfpTlJl yl"i 1 tti ppQarp^n nplivprv 

C111C1 tidlL V CI±1HL*IJCII L U.111 I^VjOCII VjCIII UCllVCl y 


3 


l Yil~T"Q tac*T"1~ii m rocar^un nplnrpnr 
1I1LI dpcLI L U.111 CCoaiGcLIl U.C11VCIV 


4 


elective cesarean delivery 


5 


Ante- or immediate postpartum haemorrhage 


0, 1 


Prolonged/obstructed labour 


0, 1 


Uterine rupture 


0, 1 



Table 1: list of input variables 



5.2 Results and discussion 
5.2.1 Building the classifier 

All analyses related to the proposed classification method were performed in the R programming environment 
Association rule mining has been done using the package arules [6j[l]- Classification rules (classifier) were obtained by 
setting algorithm's parameters as follows: 

• - maximum length of risk pattern: 3 or 4 

• - threshold for local support: 9%, 10% or 15% 

• - ratio of confidence by frequency of death: 3, 4 or 5 

Combining these parameters results in eighteen classifiers. The best of classifier is selected from this set of classifier by 
examining the variation of the sensitivity with respect to the specificity. The performance measures associated with 18 
classification rules are gathered in the following matrix. 
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nurn 


LocSupp 


MinConf 


Maxlhs 


S6nsitivity 




Classification Error 


1 


9% 


3 


3 


0.925 


0.684 


0.317 


2 


9% 


3 


4 


0.925 


0.686 


0.309 


3 


9% 


4 


3 


0.824 


0.816 


0.247 


4 


9% 


4 


4 


0.884 


0.760 


0.247 


5 


9% 


5 


3 


0.739 


0.867 


0.136 


6 


9% 


5 


4 


0.739 


0.870 


0.135 


7 


10% 


3 


3 


0.925 


0.684 


0.317 


8 


10% 


3 


4 


0.925 


0.686 


0.309 


9 


10% 


4 


3 


0.824 


0.816 


0.186 


10 


10% 


4 


4 


0.884 


0.760 


0.248 


11 


10% 


5 


3 


0.739 


0.867 


0.136 


12 


10% 


5 


4 


0.739 


0.870 


0.136 


13 


15% 


3 


3 


0.935 


0.679 


0.314 


14 


15% 


3 


4 


0.930 


0.685 


0.314 


15 


15% 


4 


3 


0.829 


0.815 


0.189 


16 


15% 


4 


4 


0.879 


0.763 


0.251 


17 


15% 


5 


3 


0.734 


0.867 


0.137 


18 


15% 


5 


4 


0.734 


0.875 


0.137 



Table 2: Performance measures of the eighteen classifiers 



• Selection by the ROC curve 

The representative set of risk pattern (classifier) is designed to produce only a class decision, i.e., "positive class" 
or "negative class" on each record. It is said to be a discrete binary classifier [5]. When such a discrete classifier 
is applied to a test set, it yields a confusion matrix only, which in turn corresponds to one ROC point. Thus, a 
discrete classifier produces only a single point in ROC space (see Fig.l). An ROC graph depicts relative tradeoffs 
between sensitivity and specificity. It helps us to choose the classifier that is the optimal among all the eighteen 
classifiers. Informally, one point in ROC space is better than another one if it is to the Northwest (sensibility 
is higher, specificity is lower, or both) of the first. In Fig.l, the optimal classifier corresponds to the point with 
specificity equal to 0.816 and sensibility equal to 0.824, which is the classifier number 9 in Tab. 2. 
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T 



0.0 0.2 0,4 0.6 0.3 1.0 

1 -specificity 

Figure 1: ROC graph for comparison of performances measures 

5.2.2 Tree-structured presentation of risk patterns that defines the optimal classifier 

A tree structure can be used to visualise the rules of the representative set of risk pattern that defines the optimal 
classification function. This tree structure allows us to present the classification rule in a form easy to understand as a 
decision tree. Each branch of the tree constitutes a risk pattern whose relative risk is given at the terminal node of the 
branch. 
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Referral from another 
health facility 



Total 
Population 



No Vaginal bleeding 
(near the term) diag- 
nosed 




(Uterine mpturejT^ 



No Vaginal bleeding 
(near the term) diag- 
nosed 



Emergency antepar- 
tum cesarean delivery 




[Antenatal care 
attendance equal 



No Antenatal 
care attendance 



No Premature rupture 
of the membranes 



prolonged / obstructed 
labour; RR=27.11 



No Previous caesarean 
section; RR=26.28 



Delivery by forceps/ 
vacuum ; RR=10.85 



Parity equal to 1, 2, 3 
or 4; RR=7.96 



Age between 17 and 
34 years ; RR=7.43 



Emergency antepar- 
tum cesarean delivery 
;RR=5.84 



Parity equal to 1, 2, 3 
or 4; RR=8.22 



No Antenatal care 
attendance; RR=6.25 



No Multiple preg- 
nancy; RR=4.99 



prolonged /obstructed 
labour; RR=4.93 



Delivery by forceps/ 
vacuum; RR=10.13 



Ante- or immediate 
postpartum haemor- 
rhage 



Parity equal to 
1, 2, 3 or 4 




No Antenatal 
care attendance 




Referral from another 
health facility 



No Vaginal bleeding 
(near the term) diag- 
nosed 



No Labour induction; 
RR= 14.91 



prolonged /obstructed 
labour; RR=15.10 



[No Malaria diagnosed; 
IrR=16.14 



No Labour induction; 
RR- 16.33 



Parity equal to 1, 2, 3 
or 4; RR=17.66 




No Malaria diagnosed; 
RR-25.84 



Age between 17 and 
34 year ; RR=26.93 



Parity equal to 1, 2, 3 
or 4; RR=28.20 



Referral from another 
health facility ;RR=30.70 



Figure 2: Tree representation of rules mined 
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5.2.3 Discussion 



For clinical use, the resulting tree structures, shown in Figure 2, can be used to visualize the rule mined. Each branch of 
the tree constitutes a risk pattern which relative risk of in-hospital mortality is given at the terminal node of the branch. 
And for each node, a variable value pair is represented. Immediately below each branch, we show the next split of the 
sub-population classified to a given branch which is based on a set of new value pair. For example, according to Figure 
2, "patients who have haemorrhage, no vaginal bleeding and are referred from another health facility" are 30,70 times 
more likely to die than the population average. For the patients who have "haemorrhage, no vaginal bleeding and parity 
between 1 and 4", the relative risk (RR) of in-hospital-death increases to 28,20 and for the patients who are referred 
from another health facility and have an uterine rupture and a prolonged/obstructed labour, the RR increases to 27, 11. 
Similar interpretation of the tree structure can be made for each main node of the tree that identify, respectively, patients 
with no antenatal care attendance, patients with emergency caesarean section or referred from other health facilities. 

The classification rule established in this study confirms that patients with uterine rupture, haemorrhage, prolonged/obstructed 
labour or parity between 1 and 4, should be managed with high priority by qualified health professionals in comprehen- 
sive EmOC services [3]. especially if the patient is referred from another health facility. Given the human resources crisis 
in Mali and Senegal, the availability of qualified personnel (midwives and doctors) is problematic and many tasks are 
delegated to less qualified health personnel (students, matrons, nurse-assistants). These professionals may play a crucial 
role in improving maternal outcomes in referral hospitals if they are involved in appropriate tasks and are adequately 
trained. Specifically, our results indicate that they should be trained to detect uterine rupture and haemorrhage. The 
required tasks and actions are quite specific and simple: ask about pain and contractions, as well as vaginal bleeding, 
measure blood pressure, protein dipstick, detect excessive blood loss and seizures. Even non-qualified health personnel 
could detect, at admission or during labour/delivery, the following alarm signs: acute pain and loss of contractions, blood 
pressure > 140/90 mmHg, proteinuria>l+, haemorrhage; and they should then immediately alert qualified professionals 
if any of these signs are detected. The early detection of this signs of complication, and immediate management by 
midwives or doctors would improve maternal outcomes 

The rules identified by the association classification algorithm provide useful knowledge to health care professionals in 
referral hospitals in Mali and Senegal, and can serve as a reference in their decision to manage patients delivering in 
their health facilities. 

5.2.4 Comparison with alternatives methods 

Bagging is a common way to improve a classification task on the basis of a logistic regression model or a classification tree. 
It consists in combining classifiers learned on bootstrap balanced samples formed by using weighted re-sampling scheme 
One obtains classifiers that owns better performance measures (sensitivity and specificity) but all the interesting 
features of the regression models can not be recovered easily, like the ability to evaluate the contribution of each covariate 
in the variation of the target class conditional probability or the attributes' patterns which are correlated to the target 
class. 

According to Table 3, bagging decision tree is less efficient than the model of classification association rules. Only the 
bagging logistic regression model [7] is comparable to the model of classification by association rules. The bagging logistic 
regression model has a greater true-negative rate than that classification association rules model (83.6% [85.98-86.62] vs 
81.6% [81.16-82.04]) and a smaller global error than that classification association rules (13.7% vs 18.4%). The advantage 
of classification association rules is that it is possible to determine risk patterns and to present them in the form of tree 
easy to apprehend practical decision making situation (Figure 2). While it is not possible with Bagging logistic regression 
model. 



Models 




Observed 


sensitivity specificity Global Error 


Association rules 


Predicted 


Death Not Death Total 
Death 164 5454 5618 
Not Death 35 24186 24221 


0.824 0.816 0.184 


Bagging Logistic 
regression 


Predicted 


Death 254 6085 6339 
Not Death 55 38365 38420 


0.822 0.863 0.137 


Bagging Decision 
Tree 


Predicted 


Death 233 10643 10876 
Not Death 76 33807 33883 


0.754 0.760 0.239 











Table 3: Performances measures for two alternatives methods classification vs class association rule method 



6 Concluding remarks 

This paper aims at advocating a methodology to state a binary classification function when dealing with a classification 
task where the target class is a rare event. Assuming that a large amount of data is available, this goal is achieved 
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by resorting to association rules for exploring the data in order to identify the patterns that are correlated with the 
target class. Relevant patterns are selected on the basis of their relative risk, their true-positive rates and true-negative 
rates. The procedure allows to overcome the short-coming of the regression methods which underestimate the conditional 
probabilities of the occurrence of the target class when the frequency of the instances which belong to this class is very 
low. Moreover patterns of attributes' interactions which are highly correlated with target class are specified, thus the 
classification function does not appear like a black-box. Nevertheless one should notice that a stage of data preprocessing 
is needed before performing the procedure since it is assumed that the covariates are evaluated on a non-numerical scale. 
The effectiveness of the proposed method is shown by its application to a real world data related to the study of in-hospital 
maternal mortality. 
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